Applying Pattern Mining to Web Information Extraction
نویسندگان
چکیده
Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree. In addition, incomplete patterns are further revised by pattern alignment to comprehend all pattern instances. This new track to IE involves no human e ort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.
منابع مشابه
Web Usage Mining: User Navigational Patterns Extraction from Web Logs
Web Usage Mining is the process of applying data mining techniques to the discovery of usage patterns from data extracted from Web Log files. In this paper, we define the notion of a “user session” as being a temporally compact sequence of web accesses by a user. We also define a new distance measure between two web sessions that captures the organization of a web site. Web usage mining consist...
متن کاملWeb Usage Mining: users' navigational patterns extraction from web logs using ant-based clustering method
Web Usage Mining is the process of applying data mining techniques to the discovery of usage patterns from data extracted from Web Log files. It mines the secondary data (web logs) derived from the users' interaction with the web pages during certain period of Web sessions. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. In this paper, w...
متن کاملPrioritization of Domain-Specific Web Information Extraction
It is often desirable to extract structured information from raw web pages for better information browsing, query answering, and pattern mining. Many such Information Extraction (IE) technologies are costly and applying them at the web-scale is impractical. In this paper, we propose a novel prioritization approach where candidate pages from the corpus are ordered according to their expected con...
متن کاملAutomatic Acquisition of Similarity between Entities by Using Web Search Engine
Web mining is the application of data mining technology to discover patterns from the web. The various tasks on web such as relation extraction, community mining, document clustering and automatic metadata extraction. A previously proposed web-based semantic similarity measures on three benchmark datasets showing high correlation with human rating. One of the main problems in information retrie...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001